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A broad introduction to RNA-Seq 


Felix Richter 


Abstract 


RNA is a nucleic acid, like DNA, with many fundamental biological roles in when, where, and by how much genes 
are turned on. Classically, sections of DNA are copied to RNA which are decoded into proteins that carry out cel- 
lular functions, but RNAs also have many roles that fall outside of this framework. RNA-Seq is a technique that is 
used to obtain snapshots of this continuously changing RNA landscape within a cell, with broad applications across 
the life sciences from agriculture to medicine. RNA-Seq is typically used to analyze the amount of each gene's 
RNA in experimental samples (i.e., gene expression), as well as changes made during RNA processing (e.g., alter- 
native splicing, editing, mutations, or fusions between multiple RNAs). RNA-Seq requires molecular biology and 
computational steps, which are described in this review. Recent advances in RNA-Seq include the ability to study 


single cells and entire single RNA molecules. 
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Introduction 


RNA-Seq, named as an abbreviation of "RNA sequenc- 
ing" and sometimes spelled RNA-seq, RNAseq, or 
RNASegq, uses next-generation sequencing (NGS) to re- 
veal the presence and quantity of ribonucleic acid (RNA) 
in a biological sample at a given moment. "J?! 


RNA-Seq is used to analyze the continuously changing 
cellular transcriptome (Figure 1). Specifically, RNA-Seq 
facilitates the ability to look at alternative gene spliced 
transcripts, post-transcriptional modifications, gene fu- 
sion, mutations/single nucleotide polymorphisms 
(SNPs) and changes in gene expression over time, or 
differences in gene expression in different groups or 
treatments."! In addition to messenger RNA (mRNA) 
transcripts, RNA-Seq can look at different populations 
of RNA to include total RNA, small RNA, such as mi- 
croRNA (miRNA), transfer RNA (tRNA), and ribosomal 
profiling.“! RNA-Seq can also be used to determine 
exon/intron boundaries and verify or amend previously 
annotated 5' and 3' gene boundaries. Recent advances 
in RNA-Seq include single cell sequencing, in situ se- 
quencing of fixed tissue, and native RNA molecule se- 
quencing with single-molecule real-time sequencing.” 
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Prior to RNA-Seq, gene expression studies were done 
with hybridization-based microarrays. Issues with mi- 
croarrays include cross-hybridization artifacts, poor 
quantification of lowly and highly expressed genes, and 
needing to know the sequence a priori.'*! Because of 
these technical issues, transcriptomics transitioned to 
sequencing-based methods. These progressed from 
Sanger sequencing of Expressed Sequence Tag librar- 
ies, to chemical tag-based methods (e.g., serial analysis 
of gene expression), and finally to the current technol- 
ogy, next-gen sequencing of complementary DNA 
(cDNA), notably RNA-Seq. 


Library preparation 
See also: w:Library (biology) 


The general steps to prepare a complementary DNA 
(cDNA) library for sequencing are described below, but 
often vary between platforms. 1/201 


1. RNA Isolation: RNA is isolated from tissue and 
mixed with deoxyribonuclease (DNase). DNase re- 
duces the amount of genomic DNA. The amount of 
RNA degradation is checked with gel and capillary 
electrophoresis and is used to assign an RNA integ- 
rity number to the sample. This RNA quality and the 
total amount of starting RNA are taken into consid- 
eration during the subsequent library preparation, 
sequencing, and analysis steps. 

2. RNA selection/depletion: To analyze signals of inter- 


est, the isolated RNA can either be kept as is, fil- 
tered for RNA with 3' polyadenylated (poly(A)) tails 


to include only mRNA, depleted of ribosomal RNA 
(rRNA), and/or filtered for RNA that binds specific 
sequences (RNA selection and depletion methods 
table, below). The RNA with 3' poly(A) tails are 
mainly composed of mature, processed, coding se- 
quences. Poly(A) selection is performed by mixing 
RNA with poly(T) oligomers covalently attached to 
a substrate, typically magnetic beads."4J22] Poly(A) 
selection has important limitations in RNA biotype 
detection. Many RNA biotypes are not polyadenyl- 
ated, including many noncoding RNA and histone- 
core protein transcripts, or are regulated via their 
poly(A) tail length (e.g., cytokines) and thus might 
not be detected after poly(A) selection.“*) Further- 
more, poly(A) selection may increased 3' bias, es- 
pecially with lower quality RNA.“4105! These limita- 
tions can be avoided with ribosomal depletion, re- 
moving rRNA that typically represents over 90% of 
the RNA in a cell. Both poly(A) enrichment and ri- 
bosomal depletion steps are labor intensive and 
could introduce biases, so more simple approaches 
have been developed to omit these steps."° Small 
RNA targets, such as miRNA, can be further iso- 
lated through size selection with exclusion gels, 
magnetic beads, or commercial kits. 


3. cDNAsynthesis: RNA is reverse transcribed to 
cDNA because DNA is more stable and to allow for 
amplification (which uses DNA polymerases) and 
leverage more mature DNA sequencing technol- 
ogy. Amplification subsequent to reverse transcrip- 
tion results in loss of strandedness, which can be 
avoided with chemical labeling or single molecule 
sequencing. Fragmentation and size selection are 
performed to purify sequences that are the appro- 
priate length for the sequencing machine. The 
RNA, cDNA, or both are fragmented with enzymes, 
sonication, or nebulizers. Fragmentation of the 
RNA reduces 5' bias of randomly primed-reverse 
transcription and the influence of primer binding 
sites, 2] with the downside that the 5' and 3' ends 
are converted to DNA less efficiently. Fragmenta- 
tion is followed by size selection, where either 
small sequences are removed or a tight range of se- 
quence lengths are selected. Because small RNAs 
like miRNAs are lost, these are analyzed inde- 
pendently. The cDNA for each experiment can be 
indexed with a hexamer or octamer barcode, so 
that these experiments can be pooled into a single 
lane for multiplexed sequencing. 


Sequencing 
See also: w:DNA sequencing 


The cDNA library derived from RNA biotypes is then se- 
quenced into a computer-readable format. There are 
many high-throughput sequencing technologies for 
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cDNA sequencing including platforms developed by II- 
lumina, Thermo Fisher, BGI/MGI, PacBio, and Oxford 
Nanopore Technologies." For Illumina short-read se- 
quencing, acommon technology for cDNA sequencing, 
adapters are ligated to the cDNA, DNA is attached to a 
flow cell, clusters are generated through cycles of 
bridge amplification and denaturing, and sequence-by- 
synthesis is performed in cycles of complementary 
strand synthesis and laser excitation of bases with re- 
versible terminators. Sequencing platform choice and 
parameters are guided by experimental design and 
cost. Common experimental design considerations in- 
clude deciding on the sequencing length, sequencing 
depth, use of single versus paired-end sequencing, 
number of replicates, multiplexing, randomization, and 
spike-ins."®) 


Single-molecule real-time RNA sequencing 
See also: w:Single-molecule real-time sequencing 


Massively parallel single molecule direct RNA-Seq has 
been explored as an alternative to traditional RNA-Seq, 
in which RNA-to-cDNA conversion, ligation, ampli- 
fcation, and other sample manipulation steps may in- 
troduce biases and artifacts."°! Technology platforms 
that perform single-molecule real-time RNA-Seq in- 
clude Oxford Nanopore Technologies (ONT) Nanopore 
sequencing,'7°! PacBio IsoSeq, and Helicos (bankrupt). 
Sequencing RNA in its native form preserves modifica- 
tions like methylation, allowing them to be investigated 
directly and simultaneously.“ Another benefit of sin- 
gle-molecule RNA-Seq is that transcripts can be cov- 
ered in full length, allowing for higher confidence iso- 
form detection and quantification compared to short- 
read sequencing. Traditionally, single-molecule RNA- 
Seq methods have higher error rates compared to 
short-read sequencing, but newer methods like ONT di- 
rect RNA-Seq limit errors by avoiding fragmentation 
and cDNA conversion. Recent uses of ONT direct RNA- 
Seq for differential expression in human cell popula- 
tions have demonstrated that this technology can over- 
come many limitations of short and long cDNA se- 
quencing. 2! 


Single-cell RNA sequencing (scRNA-Seq) 
See also: w:Single cell sequencing 


Standard methods such as microarrays and standard 
bulk RNA-Seq analysis analyze the expression of RNAs 
from large populations of cells. In mixed cell popula- 
tions, these measurements may obscure critical differ- 
ences between individual cells within these popula- 
tions. 271241 
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Figure 1 | Summary of RNA-Seq. Within the organism, genes are transcribed and (in a eukaryotic organism) spliced to produce 
mature mRNA transcripts (red). The mRNA is extracted from the organism, fragmented and reverse-transcribed into stable dou- 
ble-stranded (ds) cDNA (blue). The ds-cDNA is sequenced using high-throughput, short-read sequencing methods. These se- 
quences can then be aligned to a reference genome sequence to reconstruct which genome regions were being transcribed. This 
data can be used to annotate where expressed genes are, their relative expression levels, and any alternative splice variants."7! 
CC BY 4.0 Thomas Shafee - Own work 


RNA selection and depletion methods:"! 


State Predominant Ribosomal Unprocessed Isolation 
By type of RNA RNA content RNA content method 

Total RNA All High High None 

PolyA Coding Low Low Hybridization 

selection with poly(dT) 

oligomers 

rRNA Coding, Low High Removal of oligomers 

depletion noncoding complementary to rRNA 
RNA capture Targeted Low Moderate Hybridization with probes 


complementary to desired 
transcripts 


Single-cell RNA sequencing (scRNA-Seq) provides the = Experimental procedures 
expression profiles of individual cells. Although it is not 
possible to obtain complete information on every RNA 
expressed by each cell, due to the small amount of ma- 
terial available, patterns of gene expression can be 
identified through gene clustering analyses. This can 
uncover the existence of rare cell types within a cell 
population that may never have been seen before. For 
example, rare specialized cells in the lung called pulmo- 
nary ionocytes that express the Cystic Fibrosis Trans- 
membrane Conductance Regulator were identified in 
2018 by two groups performing scRNA-Seq on lung air- 
way epithelia.7°176) 


Current scRNA-Seq protocols involve the following 
steps: isolation of single cell and RNA, reverse transcrip- 
tion (RT), amplification, library generation and se- 
quencing (Figure 3). Single cells are either mechanically 
separated into microwells (e.g., BD Rhapsody, Takara 
ICELL8, Vycap Puncher Platform, or CellMicrosystems 
CellRaft) or encapsulated in droplets (e.g., 10x Ge- 
nomics Chromium, Illumina Bio-Rad ddSEQ, 1CellBio 
InDrop, Dolomite Bio Nadia).!”! Single cells are labeled 
by adding beads with barcoded oligonucleotides; both 
cells and beads are supplied in limited amounts such 
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that co-occupancy with multiple cells and beads is a 
very rare event. Once reverse transcription is complete, 
the cDNAs from many cells can be mixed together for 
sequencing; transcripts from a particular cell are identi- 
fied by each cell's unique barcode.?*!29! Unique molec- 
ular identifier (UMIs) can be attached to mRNA/cDNA 
target sequences to help identify artifacts during library 
preparation. 20! 


Challenges for scRNA-Seq include preserving the initial 
relative abundance of mRNA in a cell and identifying 
rare transcripts.24 The reverse transcription step is crit- 
ical as the efficiency of the RT reaction determines how 
much of the cell’s RNA population will be eventually an- 
alyzed by the sequencer. The processivity of reverse 
transcriptases and the priming strategies used may af- 
fect full-length cDNA production and the generation of 
libraries biased toward the 3’ or 5' end of genes. 


In the amplification step, either PCR or in vitro tran- 
scription (IVT) is currently used to amplify cDNA. One of 
the advantages of PCR-based methods is the ability to 
generate full-length cDNA. However, different PCR ef- 
ficiency on particular sequences (for instance, GC con- 
tent and snapback structure) may also be exponentially 
amplified, producing libraries with uneven coverage. 
On the other hand, while libraries generated by IVT can 
avoid PCR-induced sequence bias, specific sequences 
may be transcribed inefficiently, thus causing sequence 
drop-out or generating incomplete sequences.%7I(73] 
Several scRNA-Seq protocols have been published: 
Tang et al.,°*! STRT,®“! SMART-seq,?*! CEL-seq,?® 
RAGE-seq,8”! Quartz-seq?*! and C1-CAGE.%! These 
protocols differ in terms of strategies for reverse tran- 
scription, CDNA synthesis and amplification, and the 
possibility to accommodate sequence-specific bar- 
codes (i.e. UMIs) or the ability to process pooled sam- 
ples.“° 


In 2017, two approaches were introduced to simultane- 
ously measure single-cell mRNA and protein expression 
through oligonucleotide-labeled antibodies known as 
REAP-seq,'*1! and CITE-seq. 2! 


Applications 


scRNA-Seq is becoming widely used across biological 
disciplines including Development, Neurology,'?! On- 
cology, {#1451461 Autoimmune disease,” and Infectious 
disease. !*® 


scRNA-Seq has provided considerable insight into the 
development of embryos and organisms, including the 
worm Caenorhabditis elegans,*! and the regenerative 
planarian Schmidtea mediterranea." The first verte- 
brate animals to be mapped in this way were 
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Zebrafish®43) and Xenopus laevis.*! In each case mul- 
tiple stages of the embryo were studied, allowing the 
entire process of development to be mapped on a cell- 
by-cell basis.'*! Science recognized these advances as 
the 2018 Breakthrough of the Year.!°! 
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Figure 2 | Typical RNA-Seq experimental workflow. RNA are 
isolated from multiple samples, converted to cDNA libraries, 
sequenced into a computer-readable format, aligned to a refer- 
ence, and quantified for downstream analyses such as differ- 
ential expression and alternative splicing.[8] 

Malachi Griffith et al., CC BY 2.5 


Single Cell RNA Sequencing Workflow 


RT& Second-strand 


GP<& ~ o- dy ~~ < 


Solid Tissue Dissociation Single Cell Isolation RNA cDNA 
WT 4 OR 
eo % Amplified ' s a 
RNA $35 
Cell Types a) 
Identification 
RT | PCR 
t Clustering hey, 
r 5 a, , ary 
Single-cell Sequencing Sequencing Library Amplified cDNA 


Expression Profiles 


Figure 3 | Typical single-cell RNA-Seq workflow. Single 
cells are isolated from a sample into either wells or drop- 
lets, CDNA libraries are generated and amplified, librar- 
ies are sequenced, and expression matrices are gener- 
ated for downstream analyses like cell type identifica- 
tion. Yijyechern, CC BY-SA 


Experimental considerations 


A variety of parameters are considered when designing 
and conducting RNA-Seq experiments: 


e Tissue specificity: Gene expression varies 
within and between tissues, and RNA-Seq 
measures this mix of cell types. This may 
make it difficult to isolate the biological 


mechanism of interest. Single cell sequencing 
can be used to study each cell individually, 
mitigating this issue. 


e Time dependence: Gene expression changes 
over time, and RNA-Seq only takes a snapshot. 
Time course experiments can be performed to 
observe changes in the transcriptome. 


e Coverage (also known as depth): RNA harbors 
the same mutations observed in DNA, and de- 
tection requires deeper coverage. With high 
enough coverage, RNA-Seq can be used to es- 
timate the expression of each allele. This may 
provide insight into phenomena such as im- 
printing or cis-regulatory effects. The depth of 
sequencing required for specific applications 
can be extrapolated from a pilot experi- 
ment.4 


e Data generation artifacts (also known as tech- 
nical variance): The reagents (e.g., library prep- 
aration kit), personnel involved, and type of se- 
quencer (e.g., Illumina, Pacific Biosciences) 
can result in technical artifacts that might be 
mis-interpreted as meaningful results. As with 
any scientific experiment, it is prudent to con- 
duct RNA-Seq in a well controlled setting. If 
this is not possible or the study is a meta-anal- 
ysis, another solution is to detect technical ar- 
tifacts by inferring latent variables (typically 
principal component analysis or factor analy- 
sis) and subsequently correcting for these var- 
iables. 57! 


e Data management: A single RNA-Seq experi- 
ment in humans is usually 1-5 Gb (com- 
pressed) or more when including intermediate 
files.>®! This large volume of data can pose 
storage issues. One solution is compressing 
the data using multi-purpose computational 
schemas (e.g., gzip) or genomics-specific 
schemas. The latter can be based on refer- 
ence sequences or de novo. Another solution 
is to perform microarray experiments, which 
may be sufficient for hypothesis-driven work 
or replication studies (as opposed to explora- 
tory research). 


Results / Description 
See also: w:List of RNA-Seq bioinformatics tools 


An overview of RNA-Seq analysis techniques is illus- 
trated in Figure 4. 


5 of 14 | WikiJournal of Science 


WikiJournal of Science, 2021, 4(1):3 s - 
doi: 10.15347/wjs/2021.004 mo} y 
Review Article Wl 


Transcriptome assembly 


See also: w:Sequence alignment software § Short-Read 
Sequence Alignment 


Two methods are used to assign raw sequence reads to 
genomic features (i.e., assemble the transcriptome): 


De novo: This approach does not require a ref- 
erence genome to reconstruct the transcrip- 
tome, and is typically used if the genome is 
unknown, incomplete, or substantially altered 
compared to the reference.>*! Challenges 
when using short reads for de novo assembly 
include 1) determining which reads should be 
joined together into contiguous sequences 
(contigs), 2) robustness to sequencing errors 
and other artifacts, and 3) computational effi- 
ciency. The primary algorithm used for de 
novo assembly transitioned from overlap 
graphs, which identify all pair-wise overlaps 
between reads, to de Bruijn graphs, which 
break reads into sequences of length k and 
collapse all k-mers into a hash table.'®"! Over- 
lap graphs were used with Sanger sequencing, 
but do not scale well to the millions of reads 
generated with RNA-Seq. Examples of assem- 
blers that use de Bruijn graphs are Trinity, °°! 
Oases!*! (derived from the genome assem- 
bler Velvet'©!), Bridger, "7! and rnaSPAdes.!%*! 
Paired-end and long-read sequencing of the 
same sample can mitigate the deficits in short 
read sequencing by serving as a template or 
skeleton. Metrics to assess the quality of ade 
novo assembly include median contig length, 
number of contigs and N50.'°°! 


Genome guided: This approach relies on the 
same methods used for DNA alignment, with 
the additional complexity of aligning reads 
that cover non-continuous portions of the ref- 
erence genome.!®! These non-continuous 
reads are the result of sequencing spliced 
transcripts (Figure 5). Typically, alignment al- 
gorithms have two steps: 1) align short por- 
tions of the read (i.e., seed the genome), and 
2) use dynamic programming to find an opti- 
mal alignment, sometimes in combination 
with known annotations. Software tools that 
use genome-guided alignment include Bow- 
tie, © TopHat (which builds on BowTie results 
to align splice junctions), '°°""! Subread, 
STAR, 6° HISAT2, 72) and GMAP.!”2! The out- 
put of genome guided alignment (mapping) 
tools can be further utilized by tools such as 


Cufflinks’?! or StringTie!”“! to reconstruct con- 
tiguous transcript sequences (i.e., aFASTA 
file). The quality of a genome guided assembly 
can be measured with both 1) de novo assem- 
bly metrics (e.g., N50) and 2) comparisons to 
known transcript, splice junction, genome, 
and protein sequences using precision, recall, 
or their combination (e.g., F1 score).!©*! In ad- 
dition, in silico assessment could be per- 
formed using simulated reads.!*ll’6 


A note on assembly quality: The current consensus is 
that 1) assembly quality can vary depending on which 
metric is used, 2) assembly tools that scored well in one 
species do not necessarily perform well in the other spe- 
cies, and 3) combining different approaches might be 
the most reliable. !771(781(79] 
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Figure 4 | Diagram outlining a standard RNA-Seq analy- 

sis workflow. Sequenced reads are aligned to a refer- 
ence genome and/or transcriptome and subsequently 
processed for a variaty of quality control, discovery, and 
hypothesis-driven analyses. CC BY-SA 4.0 Salubrious 
Toxin - Own work 


Gene expression quantification 


Expression is quantified to study cellular changes in re- 
sponse to external stimuli, differences between 
healthy and diseased states, and other research ques- 
tions. Transcript levels are often used as a proxy for 
protein abundance, but these are often not equivalent 
due to post transcriptional events such as RNA inter- 
ference and nonsense-mediated decay.®”! 


Expression is quantified by counting the number of 
reads that mapped to each locus in the transcriptome 
assembly step. Expression can be quantified for exons 
or genes using contigs or reference transcript annota- 
tions.'°! These observed RNA-Seq read counts have 
been robustly validated against older technologies, in- 
cluding expression microarrays and qgPCR.44l84 Tools 
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that quantify counts are HTSeq,*"! FeatureCounts, ®?! 


Recount, ©“! maxcounts, 5! FIXSEO, ®* and Cuffquant. 
These tools determine read counts from aligned RNA- 
Seq data, but alignment-free counts can also be ob- 
tained with Sailfish'®”! and Kallisto.”®! The read counts 
are then converted into appropriate metrics for hy- 
pothesis testing, regressions, and other analyses. Pa- 
rameters for this conversion are: 


e Sequencing depth/coverage: Although depth is 
pre-specified when conducting multiple RNA- 
Seq experiments, it will still vary widely be- 
tween experiments.®*! Therefore, the total 
number of reads generated in a single experi- 
ment is typically normalized by converting 
counts to fragments, reads, or counts per mil- 
lion mapped reads (FPM, RPM, or CPM). The 
difference between RPM and FPM was histori- 
cally derived during the evolution from single- 
end sequencing of fragments to paired-end se- 
quencing. In single-end sequencing, there is 
only one read per fragment (i.e., RPM = FPM). 
In paired-end sequencing, there are two reads 
per fragment (i.e., RPM =2xFPM). Sequencing 
depth is sometimes referred to as library size, 
the number of intermediary cDNA molecules 
in the experiment. 


e Gene length: Longer genes will have more frag- 
ments/reads/counts than shorter genes if tran- 
script expression is the same. This is adjusted 
by dividing the FPM by the length of a feature 
(which can be a gene, transcript, or exon), re- 
sulting in the metric fragments per kilobase of 
feature per million mapped reads (FPKM).!°° 
When looking at groups of features across 
samples, FPKM is converted to transcripts per 
million (TPM) by dividing each FPKM by the 
sum of FPKMs within a sample, 221921193] 


e Total sample RNA output: Because the same 
amount of RNA is extracted from each sample, 
samples with more total RNA will have less 
RNA per gene. These genes appear to have de- 
creased expression, resulting in false positives 
in downstream analyses.*! 


e Variance for each gene's expression: is modeled 
to account for sampling error (important for 
genes with low read counts), increase power, 
and decrease false positives. Variance can be 
estimated as a normal, Poisson, or negative bi- 
nomial distribution! and is frequently 
decomposed into technical and biological var- 
iance. 


Spike-ins for absolute quantification and detec- 
tion of genome-wide effects 


RNA spike-ins are samples of RNA at known con- 
centrations that can be used as gold standards in 
experimental design and during downstream anal- 
yses for absolute quantification and detection of 
genome-wide effects. 


e Absolute quantification: Absolute quantifica- 
tion of gene expression is not possible with 
most RNA-Seq experiments, which quantify 
expression relative to all transcripts. It is possi- 
ble by performing RNA-Seq with spike-ins, 
samples of RNA at known concentrations. Af- 
ter sequencing, read counts of spike-in se- 
quences are used to determine the relation- 
ship between each gene's read counts and ab- 
solute quantities of biological fragments!?!!%7! 
In one example, this technique was used in 
Xenopus tropicalis embryos to determine tran- 
scription kinetics. !°" 


e Detection of genome-wide effects: Changes in 
global regulators including chromatin remod- 
elers, transcription factors (e.g., MYC), acetyl- 
transferase complexes, and nucleosome posi- 
tioning are not congruent with normalization 
assumptions and spike-in controls can offer 
precise interpretation.2910001 
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Figure 5 | RNA-Seq alignment with intron-split short 
reads. In alignment of short reads to an MRNA sequence 
and the reference genome, alignment software has to ac- 
count for short reads that overlap exon-exon junctions (in 
red) and thereby skip intronic sections of the pre-mRNA 
and reference genome. Rgocs, CC BY 
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Differential expression 


The simplest but often most powerful use of RNA-Seq 
is finding differences in gene expression between two 
or more conditions (e.g., treated vs not treated); this 
process is called differential expression. The outputs 
are frequently referred to as differentially expressed 
genes (DEGs) and these genes can either be up- or 
down-regulated (i.e., higher or lower in the condition 
of interest). There are many tools that perform differ- 
ential expression. Most are run in R, Python, orthe 
Unix command line. Commonly used tools include 
DESeq,4°1 edgeR, 4°"! and voom+limma, 272%] all of 
which are available through R/Bioconductor. 21105) 
These are the common considerations when perform- 
ing differential expression: 


e Inputs: Differential expression inputs include 
(1) an RNA-Seq expression matrix (M genes x 
N samples) and (2) a design matrix containing 
experimental conditions for N samples. The 
simplest design matrix contains one column, 
corresponding to labels forthe condition being 
tested. Other covariates (also referred to as 
factors, features, labels, or parameters) can in- 
clude batch effects, known artifacts, and any 
metadata that might confound or mediate 
gene expression. In addition to known covari- 
ates, unknown covariates can also be esti- 
mated through unsupervised machine learning 
approaches including principal component, 
surrogate variable,“°” and PEER"! analyses. 
Hidden variable analyses are often employed 
for human tissue RNA-Seq data, which typi- 
cally have additional artifacts not captured in 
the metadata (e.g., ischemic time, sourcing 
from multiple institutions, underlying clinical 
triats, collecting data across many years with 
many personnel). 


e Methods: Most tools use regression or non- 
parametric statistics to identify differentially 
expressed genes, and are either based on read 
counts mapped to a reference genome 
(DESeq2, limma, edgeR) or based on read 
counts derived from alignment-free quantifi- 
cation (sleuth, "0%! Cuffdiff, 1°! 
Ballgown"™4).!171 Following regression, most 
tools employ either familywise error rate 
(FWER) or false discovery rate (FDR) p-value 
adjustments to account for multiple hypothe- 
ses (in human studies, ~20,000 protein-coding 
genes or ~50,000 biotypes). 


e Outputs: A typical output consists of rows cor- 
responding to the number of genes and at 


least three columns, each gene's log fold 
change (log-transform of the ratio in expres- 
sion between conditions, a measure of effect 
size), p-value, and p-value adjusted for multi- 
ple comparisons. Genes are defined as biolog- 
ically meaningful if they pass cut-offs for effect 
size (log fold change) and statistical signifi- 
cance. These cut-offs should ideally be speci- 
fied a priori, but the nature of RNA-Seq experi- 
ments is often exploratory so it is difficult to 
predict effect sizes and pertinent cut-offs 
ahead of time. 


e = Pitfalls: The raison d'etre for these complex 
methods is to avoid the myriad of pitfalls that 
can lead to statistical errors and misleading 
interpretations. Pitfalls include increased false 
positive rates (due to multiple comparisons), 
sample preparation artifacts, sample hetero- 
geneity (like mixed genetic backgrounds), 
highly correlated samples, unaccounted for 
multi-level experimental designs, and poor 
experimental design. One notable pitfall is 
viewing results in Microsoft Excel."**! Alt- 
hough convenient, Excel automatically con- 
verts some gene names (SEPT1, DEC1, 
MARCH2) into dates or floating point num- 
bers. 


e Choice of tools and benchmarking: There are 
numerous efforts that compare the results of 
these tools, with DESeq2 tending to moder- 
ately outperform other meth- 
ods, 214](115][116)[117][118}[119][120] As with other 
methods, benchmarking consists of compar- 
ing tool outputs to each other and known gold 
standards. 


Downstream analyses for a list of differentially ex- 
pressed genes come in two flavors, validating observa- 
tions and making biological inferences. Owing to the 
pitfalls of differential expression and RNA-Seq, im- 
portant observations are replicated with (1) an orthog- 
onal method in the same samples (like real-time PCR) 
or (2) another, sometimes pre-registered, experiment 
in anew cohort. The latter helps ensure generalizability 
and can typically be followed up with a meta-analysis of 
all the pooled cohorts. The most common method for 
obtaining higher-level biological understanding of the 
results is gene set enrichment analysis, although some- 
times candidate gene approaches are employed. Gene 
set enrichment determines if the overlap between two 
gene sets is statistically significant, in this case the over- 
lap between differentially expressed genes and gene 
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sets from known pathways/databases (e.g., Gene On- 
tology, KEGG, Human Phenotype Ontology) or from 
complementary analyses in the same data (like co-ex- 
pression networks). Common tools for gene set enrich- 
ment include web interfaces (e.g., ENRICHR, g:profiler, 
WEBGESTALT)"?4 and software packages. When eval- 
vating enrichment results, one heuristic is to first look 
for enrichment of known biology as a sanity check and 
then expand the scope to look for novel biology. 


Alternative splicing 


RNA splicing is integral to eukaryotes and contributes 
significantly to protein regulation and diversity, occur- 
ring in >90% of human genes."”2! There are multiple al- 
ternative splicing modes: exon skipping (most common 
splicing mode in humans and higher eukaryotes), mutu- 
ally exclusive exons, alternative donor or acceptor sites, 
intron retention (most common splicing mode in plants, 
fungi, and protozoa), alternative transcription start site 
(promoter), and alternative polyadenylation (Figure 
6)."23] One goal of RNA-Seq is to identify alternative 
splicing events and test if they differ between condi- 
tions. Long-read sequencing captures the full transcript 
and thus minimizes many of issues in estimating iso- 
form abundance, like ambiguous read mapping. For 
short-read RNA-Seq, there are multiple methods to de- 
tect alternative splicing that can be classified into three 
main groups: !241l12511126] 


e Count-based (also event-based, differential 
splicing): estimate exon retention. Examples 
are DEXSeq,"77, MATS"78_ and SeqGSEA!29], 


e |soform-based (also multi-read modules, differ- 
ential isoform expression): estimate isoform 
abundance first, and then relative abundance 
between conditions. Examples are Cufflinks 
2!3°] and DiffSplice!*"). 


e = Intron excision based: calculate alternative 
splicing using split reads. Examples are 
MAJSIO"*2] and Leafcutter??). 


Differential gene expression tools can also be used for 
differential isoform expression if isoforms are quanti- 
fied ahead of time with other tools like RSEM."*! 


Coexpression networks 


Coexpression networks are data-derived representa- 
tions of genes behaving ina similar way across tissues 
and experimental conditions.'**! Their main purpose 
lies in hypothesis generation and guilt-by-association 
approaches for inferring functions of previously un- 
known genes."**! RNA-Seq data has been used to infer 
genes involved in specific pathways based on Pearson 
correlation, both in plants'7*! and mammals."*”! The 


fC) 


main advantage of RNA-Seq data in this kind of analy- 
sis over the microarray platforms is the capability to 
cover the entire transcriptome, therefore allowing the 
possibility to unravel more complete representations 
of the gene regulatory networks. Differential regula- 
tion of the splice isoforms of the same gene can be de- 
tected and used to predict their biological func- 

tions. 77810391 Weighted gene co-expression network 
analysis has been successfully used to identify co-ex- 
pression modules and intramodular hub genes based 
on RNA seq data. Co-expression modules may corre- 
spond to cell types or pathways. Highly connected in- 
tramodular hubs can be interpreted as representatives 
of their respective module. An eigengene is a weighted 
sum of expression of all genes in a module. Eigengenes 
are useful biomarkers (features) for diagnosis and 
prognosis." Variance-Stabilizing Transformation ap- 
proaches for estimating correlation coefficients based 
on RNA seq data have been proposed. !?6! 


Exon skipping 


Mutually exclusive exons 


! 


Alternative 5° donor sites 


. 


Alternative 3 acceptor sites 


Intron retention 


Figure 6 | Examples of alternative RNA splicing modes. 
Exons are represented as blue and yellow blocks, spliced 
introns as horizontal black lines connecting two exons, 
and exon-exon junctions as thin grey connecting lines 
between two exons. 

Allen Gathman, CC BY-SA 
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RNA-Seq captures DNA variation, including single nu- 
cleotide variants, small insertions/deletions. and struc- 
tural variation. Variant calling in RNA-Seq is similar to 
DNA variant calling and often employs the same tools 
(including SAMtools mpileup"*2) and GATK Haplotype- 
Caller!*2!) with adjustments to account for splicing. One 
unique dimension for RNA variants is allele-specific ex- 
pression (ASE): the vairants from only one haplotype 
might be preferentially expressed due to regulatory ef- 
fects including imprinting and expression quantitative 
trait loci, and noncoding rare variants./7!0441 Ljmita- 
tions of RNA variant identification include that it only 
reflects expressed regions (in humans, <5% of the ge- 
nome), could be subject to biases introduced by data 
processing (e.g., de novo transcriptome assemblies un- 
derestimate heterozygosity"**!), and has lower quality 
when compared to direct DNA sequencing. 


RNA editing (post-transcriptional alterations) 
See also: w:RNA editing 


Having the matching genomic and transcriptomic se- 
quences of an individual can help detect post-tran- 
scriptional edits (RNA editing)."7! A post-transcriptional 
modification event is identified if the gene's transcript 
has an allele/variant not observed in the genomic data. 


pre-mRNA 


pre-mRNA 
(Gene 1) 


(Gene 2) 


Gene 
Junction 


End-paired 
short reads 


D> 
a) Trans, . .  b)Cis. 


Figure 7 | Gene fusion event. A gene fusion event and 
the behaviour of paired-end reads falling on both sides of 
the gene union. Gene fusions can occur in Trans, be- 
tween genes on separate chromosomes, or in Cis, be- 
tween two genes on the same chromosome. 

Rgocs, CC BY 
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Fusion gene detection 


See also: w:Fusion gene 


Caused by different structural modifications in the ge- 
nome, fusion genes have gained attention because of 
their relationship with cancer.“ The ability of RNA- 
Seq to analyze a sample's whole transcriptome in an un- 
biased fashion makes it an attractive tool to find these 
kinds of common events in cancer.!! 


The idea follows from the process of aligning the short 
transcriptomic reads to a reference genome. Most of 
the short reads will fall within one complete exon, anda 
smaller but still large set would be expected to map to 
known exon-exon junctions. The remaining unmapped 
short reads would then be further analyzed to deter- 
mine whether they match an exon-exon junction where 
the exons come from different genes. This would be ev- 
idence of a possible fusion event, however, because of 
the length of the reads, this could prove to be very 
noisy. An alternative approach is to use paired-end 
reads, when a potentially large number of paired reads 
would map each end to a different exon, giving better 
coverage of these events (Figure 7). Nonetheless, the 
end result consists of multiple and potentially novel 
combinations of genes providing an ideal starting point 
for further validation. 


Discussion 


RNA-Seq and other NGS-based methods have flour- 
ished over the last decade. The number of manuscripts 
referring to RNA-Seq in the title or abstract (Figure 8, 
blue line) is continuously increasing with 6754 manvu- 
scripts published in 2018 (link to PubMed search). The 
intersection of RNA-Seq and medicine (Figure 8, gold 
line, link to PubMed search) has similar celerity. 


Applications to medicine 


RNA-Seg has the potential to identify new disease biol- 
ogy, profile biomarkers for clinical indications, infer 
druggable pathways, and make genetic diagnoses. 
These results could be further personalized for sub- 
groups or even individual patients, potentially high- 
lighting more effective prevention, diagnostics, and 
therapy. The feasibility of this approach is in part dic- 
tated by costs in money and time; a related limitation is 
the required team of specialists (bioinformaticians, 
physicians/clinicians, basic researchers, technicians) to 
fully interpret the huge amount of data generated by 
this analysis. 
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Search 
=m RNAseq 


000 me RNAseqg and medicine 


Manuscripts 


Figure 8 | Pubmed manuscript matches highlight the growing 
popularity of RNA-Seq. Matches are for RNA-Seq (blue, search 
terms: "RNA Seq" OR "RNA-Seq" OR "RNA sequencing" OR 
"RNASeq") and RNA=Seq in medicine (gold, search terms: 
("RNA Seg" OR "RNA-Seq" OR "RNA sequencing" OR 
"RNASeq") AND "Medicine"). The number of manuscripts on 
PubMed featuring RNA-Seq is still increasing. CC BY-SA 4.0 
Salubrious Toxin - Own work 


Large-scale sequencing efforts 


A lot of emphasis has been given to RNA-Seq data after 
the Encyclopedia of DNA Elements (ENCODE) and The 
Cancer Genome Atlas (TCGA) projects have used this 
approach to characterize dozens of cell lines“) and 
thousands of primary tumor samples,"**! respectively. 
ENCODE aimed to identify genome-wide regulatory re- 
gions in different cohort of cell lines and transcriptomic 
data are paramount in order to understand the down- 
stream effect of those epigenetic and genetic regula- 
tory layers. TCGA, instead, aimed to collect and analyze 
thousands of patient's samples from 30 different tumor 
types in order to understand the underlying mecha- 
nisms of malignant transformation and progression. In 
this context RNA-Seq data provide a unique snapshot 
of the transcriptomic status of the disease and look at 
an unbiased population of transcripts that allows the 
identification of novel transcripts, fusion transcripts 
and non-coding RNAs that could be undetected with 
different technologies. 
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